An Expanded Taxonomy of Semiotic Classes for Text Normalization

نویسندگان

  • Daan van Esch
  • Richard Sproat
چکیده

We describe an expanded taxonomy of semiotic classes for text normalization, building upon the work in [1]. We add a large number of categories of non-standard words (NSWs) that we believe a robust real-world text normalization system will have to be able to process. Our new categories are based upon empirical findings encountered while building text normalization systems across many languages, for both speech recognition and speech synthesis purposes. We believe our new taxonomy is useful both for ensuring high coverage when writing manual grammars, as well as for eliciting training data to build machine learning-based text normalization systems.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Text Normalization System for Bangla

This paper describes a process of text normalization system for the Bangla language (exonym: Bengali) by identifying the semiotic classes from Bangla text corpus. After identifying the semiotic classes, a set of rules was written for tokenization and verbalization. This study is important for Text-ToSpeech (TTS) system and as well as for creating a language model used in speech recognition.

متن کامل

Normalization of Non-Standard Words in Croatian Texts

This paper presents text normalization which is an integral part of any text-to-speech synthesis system. Text normalization is a set of methods with a task to write non-standard words, like numbers, dates, times, abbreviations, acronyms and the most common symbols, in their full expanded form are presented. The whole taxonomy for classification of non-standard words in Croatian language togethe...

متن کامل

Myanmar Number Normalization for Text-to-Speech

--Text Normalization is an essential module for Text-to-Speech (TTS) system as TTS systems need to work on real text. This paper describes Myanmar number normalization designed for Myanmar Text-to-Speech system. Semiotic classes forMyanmar language are identified by the study of Myanmar text corpus and Weighted Finite State Transducers (WFST) based Myanmar number normalization is implemented. N...

متن کامل

Semiotic Analysis of Written Signs in the Road Sign Systems of Tehran City

Introduction: as a component of the urban landscape, road sign systems are among the most critical elements of urban environments. Generally speaking, the written signs dominate the design of these systems. These signs can also foster aesthetic and visual pleasure compellingly and innovatively. Furthermore, they perpetuate a specific image in the minds of their observers. This research seeks to...

متن کامل

Evaluation the theories of semiotics approach in the Reading of Architecture and Urbanism

This essay is considered an attempt to present how semiotic studies can be used as a perceptional aspect in reading architecture and urbanism. Appearance of each art is similar to creation of a “text” which transfers a set of customs, values and thought together with itself. Production of each “text” is based on its context, culture and intellectual bed of its origin society. Each text is an ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2017